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Abstract 

Background: Metagenomics, based on culture-independent sequencing, is a well-fitted approach to provide 
insights into the composition, structure and dynamics of environmental viral communities. Following recent 
advances in sequencing technologies, new challenges arise for existing bioinformatic tools dedicated to viral 
metagenome [i.e. virome) analysis as (i) the number of viromes is rapidly growing and (ii) large genomic fragments 
can now be obtained by assembling the huge amount of sequence data generated for each metagenome. 

Results: To face these challenges, a new version of Metavir was developed. First, all Metavir tools have been 
adapted to support comparative analysis of viromes in order to improve the analysis of multiple datasets. In 
addition to the sequence comparison previously provided, viromes can now be compared through their k-mer 
frequencies, their taxonomic compositions, recruitment plots and phylogenetic trees containing sequences from 
different datasets. Second, a new section has been specifically designed to handle assembled viromes made of 
thousands of large genomic fragments [i.e. contigs). This section includes an annotation pipeline for uploaded viral 
contigs (gene prediction, similarity search against reference viral genomes and protein domains) and an extensive 
comparison between contigs and reference genomes. Contigs and their annotations can be explored on the 
website through specifically developed dynamic genomic maps and interactive networks. 

Conclusions: The new features of Metavir 2 allow users to explore and analyze viromes composed of raw reads or 
assembled fragments through a set of adapted tools and a user-friendly interface. 
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Background 

Viruses are the most abundant biological entities in the 
biosphere [1] and are now considered as major players 
in natural ecosystems and their associated cycles and 
balances [2,3]. Viral communities are known to be 
mostly composed of new strains [4-6] and are difficult to 
characterize as (i) most micro-organisms are still impos- 
sible to cultivate in the lab for now, hence preventing 
the culture, isolation and study of their associated 
viruses and (ii) the absence of a single gene common to 
all viral genomes prevents the monitoring of uncultured 
viral diversity using approaches analogous to ribosomal 
DNA profiling. 
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Metagenomic approaches, consisting in a random se- 
quencing of the genetic pool isolated from natural sam- 
ples, circumvent these limitations. Experimental protocols 
to extract and isolate the encapsidated fraction are now 
well established [7-9], and viral metagenomes (Le. viromes) 
have been generated from a broad range of ecosystems. 
Beyond the description and characterization of the viral 
genomic diversity, viromes are useful towards more gen- 
eral questions such as biogeography and dispersion of viral 
particles [10,11], evolution and origin of viruses [12] or 
epidemiology [13]. 

Advances in next-generation sequencing and in sequence 
assembly techniques recently led viral metagenomics a step 
further, by providing access to large genomic fragments 
rather than only short reads [14-16]. Indeed, contigs re- 
presenting complete or near-complete viral genomes were 
assembled from 454 [17-20] and Illumina HiSeq [21-23] 
generated viromes. These large assembled sequences 
(several Kb or tens of Kb, depending on the diversity of the 
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viral community studied) provide access to the genome 
content and architecture of uncultured viruses and offer 
the possibility to gain unique insights into the main viral 
families in the environment. 

Two web-servers are currently available for a compre- 
hensive virome analysis: Metavir [24], and Virome [25]. 
A pipeline (the Viral Metagenome Affiliation Pipeline 
[26]) was also described but to our knowledge is not 
available neither as a standalone software or through a 
web page. Yet, none of these bioinformatic tools were 
designed for the analysis of assembled datasets and the 
absence of adapted tools for such assembled viromes 
was pinpointed as a major bottleneck for viral metage- 
nomic studies [25,27]. Moreover, the growing number of 
generated viromes calls for the development of compari- 
son strategies to go beyond individual analysis of each 
dataset. Here, we introduce a new version of Metavir 
that tackles these two limitations. Metavir 2 includes (i) 
new ways to compare datasets and (ii) a whole new sec- 
tion which forms the first tool designed for a compre- 
hensive analysis of assembled virome sequences. 

Implementation 

Input and metadata 

Registered users can upload their own sequence data- 
sets, either short reads or assembled contigs, in a private 
space. Input data are checked for being only composed 
of DNA sequences in fasta format (compressed files in 
zip, gzip or tar.gz format are accepted). Due to the size 
of Illuminas raw datasets (-50 Gb) and computing time 
required for assembling each dataset, the assembly step 
cannot be computed through Metavir. Furthermore, a 
wide range of softwares are available for this step and 
the choice depends on the type of the sequencing and 
the nature of the sample: Newbler (454 Life Sciences) is 
the main software used so far for 454 data [20,28,29], 
and Illumina data can be assembled with Idba_ud [15], 
SOAP [30], MetaVelvet [31] or OptiDBA [16]. 

A set of public viromes is also already available for users 
to compare with their dataset(s). These viromes are sorted 
into projects, and linked to the manuscript describing 
their analysis when available. Various metadata can be 
added, such as the type of sample from which the virome 
was sequenced, the location, depth, and temperature of 
sampling point, and the sequencing technology used to 
generate the dataset. 

Section 1 : tools to analyze raw datasets (unassembled reads) 
Taxonomic composition 

Virome reads are first compared to the complete viral 
genomes of the RefSeq Virus database using BLAST. 
The taxonomic composition is then determined using 
either raw number of best hits or number of best hits 
normalized by genome length using GAAS [32]. Krona 



[33] is now used to generate interactive charts represent- 
ing taxonomic composition of one or more viromes. A 
custom-designed javascript program has also been im- 
plemented to visualize these compositions as interactive 
heatmaps, with each column representing a dataset and 
each row a group of viral species. Columns can be 
switched by mouse drag and drop. Viral species are clas- 
sified according to the up-to-date NCBI taxonomy, and 
viral groups can be folded and unfolded with a mouse 
click. 

k-mer frequency bias 

A virome comparison based on k-mer frequency bias 
(di-, tri- and tetranucleotides are available) has been im- 
plemented as described by Willner and collaborators [34] . 
Unlike the other available comparison method, based on 
sequence similarity (generated using reciprocal tBLASTx) 
and requiring datasets containing at least 50,000 se- 
quences of lOObp, k-mer nucleotide frequencies can be 
computed for all datasets without size restriction. Briefly, 
k-mer frequency distribution bias are computed by a 
custom Perl script and then compared for each pair of 
viromes. Pairwise euclidian distances between viromes are 
stored in a matrix, which can be used as input either in a 
hierarchical clustering or a non-metric multidimensional 
scaling. Both analysis are computed with R [35] using 
pvclust [36] and vegan [37] libraries respectively. The 
non-metric multidimensional scaling (NMDS) is now also 
available for virome comparison based on sequence simi- 
larities, available in Metavir 1. 

Phylogenetic analyses 

To speed up the phylogenetic pipeline, phylogenetic 
trees are now computed with FastTree [38]. Using the 
jsPhyloSVG javascript plugin [39], phylogenetic trees 
are now interactive: they can be displayed as circular or 
linear, subtrees can be merged, and informations on the 
origin and affiliation of the sequence of each node can 
be obtained by clicking on the associated leaf. 

Individual viral genome recruitment plots 

Using the best BLAST hit results against Refseq Virus, 
each virome sequence with a hit is affiliated to a unique 
viral genome, i.e. each read is recruited by a reference 
virus. For any selected viral genome, two types of re- 
cruitment plots are then available: (i) a scatter plot dis- 
playing each recruited read as a dot depending on the 
position on the genome (on the x-axis) and the identity 
percentage of the BLAST hit (on the y-axis), and (ii) an 
histogram presenting the number of recruited reads for 
each 500-nt long genome part. These plots are generated 
using the ggplot2 R library [40]. Additional viromes that 
contain sequences recruited by the selected genomes are 
also listed and can be added to the current plot. When 
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several datasets are selected, a color is attributed to each 
virome, used to color dots (in scatter plots) or stacked 
histograms (in histograms). 

Section 2: assembled viromes annotation and display 
Contig annotation 

Open reading frames (ORFs) are first predicted for each 
contig through MetaGeneAnnotator [41]. A custom Perl 
script was designed to detect circular contigs by looking 
for identical k-mer at the two ends of the sequences. 
Each circular contig is then trimmed to remove all re- 
dundant parts. In order to be able to predict genes span- 
ning the origin of circular contigs, a temporary version 
of circular contigs is used in the ORF prediction soft- 
ware, in which the first 1,000 nucleotides are duplicated 
and added at the contigs end. It has to be noted that 
this detection of circular contigs will not be effective for 
contigs computed with assembler like Newbler which 
already detect and remove such similarity between con- 
tig ends. 

All predicted translated ORFs are then compared to se- 
veral databases, namely the RefseqVirus protein database 
from the NCBI using BLASTp [42], with a threshold of 
10~ 3 on e-value, and the PFAM database of protein do- 
mains (version 26.0; [43]) using HMMScan [44], with a 
threshold of 30 on score. A direct comparison of ORFs 
within a virome is also computed through a BLASTp with 
the same threshold of 10" 3 on e-value. 

The taxonomic composition and sequence diversity 
are not calculated the same way for datasets made of 
long genomic sequences compared to those made of 
short reads. Using the BLASTp results against reference 
viruses, three types of taxonomic compositions are com- 
puted for each dataset. These compositions are based on 

(i) best BLAST hit affiliation of each predicted gene, 

(ii) best BLAST hit affiliation of each contig, and (iii) 
lowest common ancestor affiliation of each contig. This 
LCA affiliation is designed to take into account the mul- 
tiple hits on a single contig: up to five affiliated genes 
(if available) are considered for each contig, and the 
affiliation is made at the highest common taxonomy 
level of the best BLAST hit from these selected genes. 

Finally, different clusterings of the predicted ORFs are 
computed. A global protein sequence clustering with 
three different thresholds (75, 90 and 98% of similarity) 
is performed using Uclust [45]. Another clustering is 
based on protein domain alignments: ORFs are first 
ordered by size, and used iteratively as a seed for a 
jackhmmer search [44]. All ORFs recruited by the seed 
are gathered in a cluster with this seed, and removed 
from further iterations. Once computed, the domain- 
based ORFs clusters are affiliated to one or more PFAM 
domain based on the affiliation of their members. These 
clusterings are displayed through the rarefaction curve 



tool, and cluster affiliations can be downloaded in a csv 
file. 

Contig display 

When an assembled virome is selected, a new "contig 
maps" page now provides general informations about 
ORF prediction and contig affiliations, as well as an inset 
that allows to filter the contig list and access contigs of 
interest for further analysis (contig maps and networks). 
This interactive filter, developed using Jquery, let users 
select contigs based on taxonomic or functional affilia- 
tions of predicted genes, and contig size, name or taxo- 
nomic affiliation. 

An interactive genomic map can be displayed for each 
contig, this map being drawn using RaphaelSVG and the 
Raphael-zpd plugin. Each gene affiliation to Refseq viral 
genomes and PFAM protein domains is indicated when 
available. Genes can be further investigated as nucleotide 
and protein sequences are displayed by clicking on the 
gene either on the map or on the gene table below. Con- 
tig annotations can also be downloaded as csv tables, 
summarized by contig or detailed for each ORFs. 

Similarities between contigs and viral genomes and 
between different contigs can be visualized as an inter- 
active network. In order to take into account all relevant 
similarities and not only the best BLAST hit for each 
ORF, all BLAST hits with an e-value lower than 10~ 3 
and having a bit-score within a 10% margin from the 
best BLAST hit bit-score for this ORF are used to build 
the contig network. In the resulting networks created 
with Cytoscape-web [46], contigs and reference genomes 
are represented as nodes, and sequence similarities as 
edges. Different options are available to customize the 
network, such as the coloring of edges based on BLAST 
bit-score, the display of only one edge between two simi- 
lar contigs or of one edge for each ORFs similarity, or 
the coloring of genome nodes based on the taxonomy. 
Another set of filters is also proposed to reduce the 
number of nodes or edges displayed on screen. 

Associated with this network, a contig map comparison 
tool can be used to display collinearity between contigs 
and genomes or other contigs selected on the network. 
This comparisons are displayed through RaphaelSVG and 
Raphael-zpd. Name and affiliation of each gene is dis- 
played when clicked, and a Jquery pop-up is used to 
change the sequence order within the plot. 

Common framework 
Automatic database update 

As the RefseqVirus database is quickly growing (40 new 
genomes are added on average every month), each new 
release is automatically downloaded and used as the new 
reference database. Taxonomic composition, gene affi- 
liation (for contig dataset), and recruitment plots of 
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public projects are automatically updated with each re- 
lease, whereas the update of private projects must be re- 
quested by the user. 

Results and graphics download 

All sequence datasets used in a Metavir analysis are avail- 
able for download in fasta format (affiliated and uncharac- 
terized sequences, sequences included in phylogenetic trees 
and sequences included in recruitment plots). All tables 
(taxonomic heatmap, contig and ORF affiliations, results 
for recruitment analysis) can be downloaded as csv files 
that can be imported in spreadsheet softwares. 

Contig annotations are available in GenBank file format, 
which can be used in many downstream tools like Artemis 
[47] or Easyfig [48]. These GenBank files contain the lowest 
common ancestor affiliation of the contig, as well as the 
best BLAST hit affiliation of each ORF, the functional an- 
notation of each ORF in PFAM domain, and the sequences 
of each predicted CDS. 

All interactive charts and pictures (contig maps, contig 
comparisons, phylogenetic trees) can be downloaded in svg 
format, a publication-ready vectorial format easy to modify 
using graphics softwares. Static charts generated with R are 
available to download in pdf and png file format. 

Finally, the contig networks can be downloaded in a set 
of different formats, including graphml and xgmml, ready 
to be imported in the desktop version of Cytoscape for 
further analyses and annotations. 

Case study: using metavir to analyze the human gut 
virome 

Two different datasets from the human gut viral commu- 
nity were chosen to illustrate the results that can be ob- 
tained with Metavir 2. First, a set of 16 viromes was used to 
illustrate the section dedicated to unassembled datasets 
([49]; project "Human Gut Diet" on Metavir). These meta- 
genomes, sequenced with 454 GS Titanium (884,628 reads 
of 350 bp/310 Mb), were initially designed to study the dy- 
namics of human gut viral community during a pertur- 
bation by a dietary intervention. Two individuals were fed a 
high fat/low fiber diet (HI and H2), three were fed a low 
fat/high fiber (LI, L2 and L3) and one was on an ad-lib diet 
(X). Samples were collected at up to four time points (days 
1, 2, 7 and 8). The second dataset is an assembled virome, 
resulting from the assembly of Illumina Hi-Seq 2000 reads 
(5.6 Gb of 100 bp reads) from healthy individuals ([16]; vir- 
ome "Human gut - All subjects" from project "Human Gut 
Assembly" on Metavir). This assembled dataset was used 
here to illustrate the possibilities offered by the new section 
dedicated to the analysis of contigs. 

Results and discussion 

Metavir, a web server dedicated to the analysis of viromes 
uploaded by registered users, can now be used to analyze 



the two existing types of datasets: (i) viromes composed of 
raw reads, mostly generated using pyrosequencing tech- 
nology and (ii) viromes assembled into contigs, a strategy 
possible with datasets sequenced with either pyrosequen- 
cing or Illumina technology. The novelties of version 2 of 
Metavir will be illustrated here using both types of data- 
sets (unassembled 454 reads [49] and Illumina assembled 
contigs [16]), all from human gut samples. 

Additions to the unassembled datasets section 

Most published viral metagenomes are still analyzed at 
the read level. Indeed, pyrosequencing technology is 
often chosen to generate viromes, as this technology 
produces long reads and several samples can be easily 
multiplexed in a single run. Thus, the number of reads 
in each multiplexed dataset is generally insufficient to 
produce an assembly. Furthermore, the multiple datasets 
generated make it possible to study spatial or temporal 
dynamics in environmental communities [10,22,50-52] 
or different individuals subjected to different conditions 
for eukaryote-associated viromes (e.g. different diets in 
[49]). In this context, the comparison of multiple data- 
sets was our major focus while extending the section 
dedicated to unassembled datasets. In addition to the 
rarefaction curves and reciprocal tBLASTx comparison 
available in the initial version of Metavir, taxonomic 
compositions and phylogenetic analyses can now be 
used to compare viromes. Furthermore, most of these 
tools were improved with special attention to the display 
of results. A brand new tool was also added: the recruit- 
ment plot analysis, which makes it possible to accurately 
study the similarities between virome reads and a viral 
genome of interest. 

Taxonomic composition 

Taxonomic composition of viromes is determined by se- 
quence similarity between virome reads and complete 
known viral genomes, and can be displayed as either raw 
number of hits or number of hits normalized by genome 
length [32]. Virome composition can now be visually 
compared in two ways: (i) merging multiple composi- 
tions on the same Krona chart [33] and (ii) an in-house 
developed interactive heatmap, which allows a more 
hierarchical view. As an example of the latter, a taxo- 
nomic heatmap was generated for the 16 datasets from 
the human gut (Figure 1). This heatmap allows the user 
to quickly visualize that these datasets only exhibited 
similarities with bacteriophages, in accordance with the 
results presented in Minot et al. ([49], Figure two c). 
Even when the same bacteriophage groups are found in 
the different datasets, their proportion differ between 
each virome: Myoviridae constitute between 11 and 42% 
of each virome, Podoviridae 2 - 35%, Siphoviridae 
24 - 55% and Microviridae 0 - 31%. 
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Figure 1 Taxonomic composition (best hit ratios) of the 16 unassembled datasets from the human gut viromes from Minot et al. ([49]). 

Viral species are classified according to the NCBI taxonomy, and taxonomic groups can be folded or unfolded with a mouse click. Columns have 
been re-ordered through mouse drag and drop to gather datasets from each subject. Samples are named according to the diet (X: ad-lib diet, 
H: high fat/low fiber diet, L: low fat/high fiber) of 6 subjects (X, L1, L2, L3, H1, H2) and to the day of the sample collection after the beginning of 
the experiment (d1, d2, d7 and d8). 



k-mer frequency bias 

A recurrent observation in analyses of virome data is that 
the majority of reads has no similarity to any known viral 
sequence [6], as can be noted for human gut viromes (top 
of Figure 1). Therefore, methods that consider viromes in 
their entirety rather than only the small fraction affiliated 
with known sequences are of particular interest. Analysis 
of k-mer nucleotide frequency bias is such a method and 
was proved to distinguish viromes from different biomes. 
This analysis, now available in Metavir, was here applied 
to the 16 human gut datasets using 4-mer nucleotides 
(tetranucleotides) and a non-metric multidimensional 
scaling (Figure 2). Results are again similar to those ob- 
tained in Minot et al ([49], Figure five A): even though 
viral communities seem to be affected by diet (X, H, L), 
the different samples from each subject (XI, HI, H2, LI, 
L2 and L3) are gathered indicating that each individual 
contained a unique virome. However, the k-mer analysis 
does not support the conclusion that viromes from sub- 
jects on the same diet converge over time. 



Phylogenetic analyses 

Phylogenetic analysis is of particular interest to study 
specific viral groups and such analysis was implemented 
in the first version of Metavir [24]. As no gene is com- 
mon to all viruses, several marker genes are needed 
to study the major viral groups. The list of markers, 
initially made of 8 genes, has been expanded to 13 
markers, mostly following users' requests. In Metavir 1, 
reads from a chosen virome detected as homologous to 
a selected marker were used to compute a tree including 
both these virome reads and reference sequences. How- 
ever, the lack of reference strains close to most environ- 
mental viruses limits the efficacy of such analyses and 
often results in the generation of environmental clades 
far from references. However, samples from similar 
biomes often harbor closely related viruses [5,11,52]. To 
gain a better view of the diversity in each sample and of 
the relationships between samples, Metavir 2 now offers 
the opportunity to compute phylogenetic trees that in- 
clude reads from other viromes. As an example, we 
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conducted such an analysis on the Picovirinae, a sub- 
family of Podoviridae (Maximum-likelihood tree com- 
puted with FastTree, with default parameters). Indeed, 
this group is one of the most abundant in 5 of the 16 
human gut viromes (Figure 1). A protein primed DNA 
polymerase, conserved in this family, was used to de- 
termine the phylogenetic relationships of the viruses 
retrieved in these human gut viromes (Figure 3). As 
expected, all sequences retrieved are most closely related 
to bacteriophages, and no virome reads appear to be 
linked to either archeal (Salter provir us) or eukaryotic 
viruses (Adenoviridae). Interestingly, virome sequences 
from each individual are clustered on the tree, highligh- 
ting that the Picovirinae-like phages of subject L2 are 
distinct from those of HI. Such specificity of viral strains 
to each individual was noted on a more general scale 
through virome analysis of genetically linked individuals 
[28]. In this example, phylogenetic analysis of an abun- 
dant viral family confirmed the conclusions drawn from 
the comparisons of whole viromes. 



Individual viral genome recruitment plots 

Besides the analysis of single reads through BLAST or 
phylogenetic tools, plots of metagenomic sequences re- 
cruited by reference genomes of interest can give a sense 
of how well this genome is represented in a metagenome 
(see for example [53]). Indeed, visualizing a chosen gen- 
ome and the distribution of its associated reads is useful 
to determine which genes of a known virus are found in 
an environmental dataset and the similarity level between 
reference and virome sequences. Recruitment plots can be 
generated in Metavir, and here again, several datasets can 
be included in a single plot in order to compare the gene 
conservation of a virus in different samples. As an ex- 
ample, this tool was here used to further study Lactococ- 
cus phage 1706, one of the most abundant phages in the 
16 datasets from the human gut. As this phage has been 
isolated from bacteria involved in milk fermentation and 
not directly from gut microbes, its actual presence in 
human gut samples is questionable. The plot of virome 
reads recruited by Lactococcus phage 1706 shows that 
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most characterized genes (coding for the main functions 
of the genome, i.e. replication and structure module, 
highlighted in red on the plot) are retrieved whereas most 
of the unknown genes (in blue) are not (Figure 4). This 
suggests that even though phage 1706 is the nearest 
neighbor of abundant human gut phage (s) in the current 
state of the reference databases, these gut phages do not 
have a gene content entirely similar to phage 1706. 
Furthermore, a gene cassette made of two putative tail 
proteins and two other structural proteins known to be 
major players of phage-host specificity in phage 1706 is 
scarcely retrieved in these datasets ([54]; black frame on 
Figure four). Thus, it is very likely that the phages re- 
trieved in the human gut viromes, even though similar to 
this Lactococcus phage, infect an alternative host. This 
example illustrates how recruitment plots help in further 
understanding the genomic content of environmental 
viruses and their genomic relatedness with known viruses. 

Analyzing assembled datasets using the new contig 
section 

Even though unassembled viromes proved to be useful 
for a better characterization of environmental viral 



communities, long genomic fragments generated through 
the assembly of metagenomic datasets are usually more 
informative. Indeed, complete ORFs predicted out of such 
contig sequences (i) are more often similar to known 
viruses than short reads [55], (ii) provide more robust phy- 
togenies than using reads representing only a portion of a 
gene, and (iii) are more appropriate than short random 
reads in determining the gene content and genetic di- 
versity of a viral community [56]. Moreover, analysis of 
the genomic content and architecture can provide decisive 
insights into virus classification and evolution of viral 
groups [20]. 

A new section dedicated to the annotation and navi- 
gation within sets of contigs has therefore been imple- 
mented in Metavir. When assembled viromes, i.e. sets of 
contigs, are uploaded by users, ORFs are predicted [41] 
and then annotated using sequence similarity results 
against viral genomes and protein domains. In addition to 
the general taxonomic composition, contig maps and an- 
notations can be displayed for every contig. As datasets 
can consist of tens of thousands contigs, users can choose 
to visualize contigs (i) longer than a defined threshold, 
(ii) predicted as circular or linear, (iii) affiliated to a 
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Figure 4 Recruitment plot of Lactococcus phage 1706 for 6 human gut datasets. Only the 6 viromes with more than 200 reads recruited by 
this genome were included, i.e. reads having their best BLAST hit with this genome. Stacked histograms represent the total number of reads 
similar to each 500-nt long genome part, with a different color for each virome. Each gene is plotted as a rectangle on the genome map of the 
Lactococcus phage 1706 at the bottom, with hypothetical proteins in blue, and characterized genes in red. A black frame highlights the three 
genes involved in host specificity. 



particular viral family, and/or (iv) possessing a particular 
gene. Finally, tools available for read analysis were specific- 
ally adapted to assembled datasets: taxonomic composi- 
tions are computed using either gene or contig affiliation, 
phylogenies are generated using predicted ORFs and gen- 
etic diversity is computed using either predicted ORFs or 
domain conservation. 

For the assembled human gut virome used as an example 
in this section ("Human gut - All subjects" in Metavir), 
43,078 ORFs were predicted on the 10,202 uploaded con- 
tigs. Furthermore, 60 contigs were predicted as circular 
and represent potential complete viral genomes. Using the 
"contig selection" panel, large contigs (>15kb) similar to 
Lactococcus phage 1706 were selected and further exa- 
mined. For each selected contig, a summary of its anno- 
tations is available as an interactive map. The largest 
sequence (contig_187_43, 60,257 bp) seems to be com- 
posed of two sets of genes associated with known viral ge- 
nomes (green genes at both ends of the contig), whereas a 
third and central part is made of shorter and uncharac- 
terized genes (red genes) (Figure 5). All genes but three 
are on the same strand (-), as is generally observed in 



phage genomes. Moreover, no partial gene is predicted at 
either end of the sequence, indicating that this contig may 
represent a complete genome. 

Relationships between selected contigs and viral refe- 
rences to which they are affiliated can be displayed as an 
interactive network, where contigs and reference genomes 
are represented by nodes and sequence similarities as 
edges. For example, the network containing contigs asso- 
ciated with Lactococcus phage 1706 helps to rapidly iden- 
tify that these contigs are related both to each other and 
to several Siphoviridae genomes (Figure 6A). Contigs and 
references can then be selected in this network and a 
genome comparison of the chosen sequences can be 
displayed. This map-to-map comparison allows the user 
to identify collinearity between different genomes or geno- 
mic fragments. When compared to the complete genome 
of Lactococcus phage 1706, contig_187_43 can definitely 
be considered as a putative complete genome closely 
related to this phage, as both their sizes and gene organi- 
zations are very similar (Figure 6B). Interestingly, the 
similarities between this contig and Clostridium phage 
phiCD6356 are limited to two genes which are part of the 
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Figure 5 Automatic annotation of contig 187_43 (complete name: 1470_2012_5M_iter1 _k63_scaffold187_43.0). On top, a dynamic map 
displays the predicted genes colored by affiliation (green for genes affiliated to Refseq Virus, yellow when only PFAM affiliation is available, and 
red for uncharacterized genes). The associated gene table (below) displays for each gene the accession number and annotation of the most 
similar gene in RefseqVirus and in PFAM (when available). 



host-associated cassette previously discussed. Thus, con- 
tig_187_43 likely originates from a phage closely related to 
Lactococcus phage 1706, but which could instead infect 
members of the Clostridium genus. The second contig 
displayed on Figure 6B, contig_289_22.4, only shares one 
core gene module with phage 1706 and harbors several 
similarities to a distinct Clostridium phage. These two 
contigs, that both exhibit similarities to Lactococcus phage 
1706, are here shown to be heterogeneous in nature. Fur- 
thermore, genes of contig_187_43 similar to Lactococcus 
phage 1706 correspond to the genes frequently retrieved 
in unassembled datasets (Figure 4), indicating that this 
contig might represent a prevalent virotype of the human 
gut. This genomic analysis of large assembled sequences 



exemplifies how such datasets can provide further insights 
into viral communities and viral species. 

Conclusion 

This new release of Metavir provides a wide range of 
tools to analyze either raw or assembled viral metagen- 
omes in a comprehensive way. As virome projects now 
regularly encompass multiple samples and as more and 
more viromes are being published, a special effort was 
made towards virome comparison. Two new large scale 
methods were implemented and all existing Metavir 
tools were modified so that they can be used to compare 
datasets. Furthermore, a new section has been speci- 
fically developed to handle sets of large genomic contigs. 
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Figure 6 Contig comparison through network and genome map comparison. A. Contig network including 6 contigs affiliated to 
Lactococcus phage 1706. Each contig and reference genomes are displayed as nodes, and BLAST similarities are displayed as edges. In this 
network, we chose to color nodes according to the taxonomy of the reference genomes, and to keep links between nodes only when two 
genes or more were found to be similar between the two sequences. B. Map comparison for contigs and genomes selected in the network 
(highlighted in yellow in A). The maps of these five selected sequences are vertically stacked, and BLAST hits between genes of two consecutive 
maps are depicted with gray frames. Sequences were re-ordered to display similarities between Lactococcus phage 1706 and the two contigs, as 
well as similarities between these contigs and Clostridium phages. In both network and map comparison, the contig names were simplified: 
complete name of contig 187_43 is 1470_2012_5M_iter1_k63_scaffold187_43.0, contig 298_22.4 is 1470_2012_5M_iter2_k47_ scaffold298_22.4, 
contig 334J9.8 is 1470_2012_5M_iter2_k47_scaffold334_19.8, contig 1977J4.5 is 1470_1013_5M_iter6_k39_scaffold1977_14.5, contig 271_28.5 
is 1470_2012_5M_iter2_k47_scaffold271_28.5, and contig 1957J1.1 is 1 470_1 01 3_5M_iter6_k39_scaffold 1 957_1 1.1. 



As these datasets can be large and as all individual se- 
quences can be of interest, we paid special attention to the 
interface, with filtering panels and network visualization. 
Selected contigs can then be analyzed in detail by compar- 
ing their automatic annotations in terms of gene content 



and genomic maps. Finally, with its extended or new tools 
and sections, Metavir 2 provides a comprehensive frame- 
work with a user-friendly interface to explore any kind of 
viromes, and should help virologists to make the most of 
their metagenomics data. 
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